Picture for Neel Nanda

Neel Nanda

Google DeepMind

Subliminal Learning Is Steering Vector Distillation

Add code
May 31, 2026
Viaarxiv icon

How Well Do Models Follow Their Constitutions?

Add code
May 22, 2026
Viaarxiv icon

Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

Add code
Mar 05, 2026
Viaarxiv icon

Simple LLM Baselines are Competitive for Model Diffing

Add code
Feb 10, 2026
Viaarxiv icon

Emergent Misalignment is Easy, Narrow Misalignment is Hard

Add code
Feb 08, 2026
Viaarxiv icon

What's the plan? Metrics for implicit planning in LLMs and their application to rhyme generation and question answering

Add code
Jan 28, 2026
Viaarxiv icon

Building Production-Ready Probes For Gemini

Add code
Jan 16, 2026
Viaarxiv icon

Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit

Add code
Dec 10, 2025
Viaarxiv icon

Thought Branches: Interpreting LLM Reasoning Requires Resampling

Add code
Oct 31, 2025
Viaarxiv icon

Steering Evaluation-Aware Language Models To Act Like They Are Deployed

Add code
Oct 23, 2025
Figure 1 for Steering Evaluation-Aware Language Models To Act Like They Are Deployed
Figure 2 for Steering Evaluation-Aware Language Models To Act Like They Are Deployed
Figure 3 for Steering Evaluation-Aware Language Models To Act Like They Are Deployed
Figure 4 for Steering Evaluation-Aware Language Models To Act Like They Are Deployed
Viaarxiv icon